Let’s look at the dependent variable Y18JBERNA_06:
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 1026 1519 1661 2089 10748 510
## Warning: Removed 510 rows containing non-finite values (stat_bin).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -43.37 1026.28 1501.29 1665.58 2055.32 9326.78
NOTE: See the negative values for earnings! Not a huge problem but still worth noticing.
NOTE: Count is higher because test set has 10000 obs and real data only has 1254
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -61.96 1027.57 1497.08 1656.09 2031.32 9278.14
NOTE: See the negative values for earnings! Not a huge problem but still worth noticing.
NOTE: Count is higher because test set has 10000 obs and real data only has 1254
So, we do not see anything particularly bad happening when we
generate the dependent variable using queens that are failing CDML
except negative values. But negative values are easily explained as we
do Y1 = Y0 + tau so if tau is larger than Y0
AND/OR Y0 equals 0 it will produce those negative values.
So, let’s look at the tau for this variable to see maybe something is going on there.
## V1
## Min. :-644.52
## 1st Qu.: 22.33
## Median : 140.53
## Mean : 152.93
## 3rd Qu.: 277.24
## Max. :1236.66
## s1
## Min. :-506.3
## 1st Qu.: 102.4
## Median : 117.3
## Mean : 135.6
## 3rd Qu.: 152.4
## Max. : 760.8
NOTE: Not perfect, but not that bad either?
First let’s see how our CDML performs by queen when aggregated using
sqrt(mean(value^2)) and make sure the numbers are the same
as we saw before:
Yes, the numbers are the same as expected, check passed.
Those are aggregated numbers, let’s go back to our simulation data by observation in test set (10000 obs) and see how RMSE behaves when test set was created using RF MOM DR queen:
So here is our RMSE for all 10000 test observations created using RF MOM DR queen:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45.96 99.15 135.08 175.16 194.80 107722.64
Obviously we see that 107722.64 is a huge outlier.
Let’s look at the distribution too:
Let’s look at a table of when RMSE is > 1000:
What happened with observation 5763 in the test set? Why CDML has 107722.644 for its RMSE?
Let’s look at the data for this observation in the test set:
Yes, earnings for this row are higher than average, but nothing seems to be very wrong with this row. Why did CDML fail to predict tau for it correctly? To find out we need data by simulation, and I do not have it yet. So in that particular observation in the test set, somewhere in those 50 simulations, CDML failed by so much one or multiple times that RMSE across this 50 simulation is an outlier.
But more importantly, how do we protect ourselves from these outliers?
Does the same observation in the test set causes troubles when Lasso MCM EA is a queen?
So here is our RMSE for all 10000 test observations created using Lasso MCM EA queen:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42.30 71.63 87.87 165.58 166.96 140376.65
Again, 140376.65 is a huge outlier.
Let’s look at the distribution too:
Let’s look at a table of when RMSE is > 1000:
No, now it is a different observation in the test set that is causing trobles - 7543.
What happened with observation 7543 in the test set? Why CDML has 140376.65 for its RMSE?
Let’s look at the data for this observation in the test set:
Again nothing seems to be wrong with it.
So in both cases CDML aggregated stats are skewed by ONE bad prediction. We could go further, disagregate by simulation run for that particular observation in the test set, see which simulation runs produce the outliers. What we will probably find is that for some simulation, train set is a bad draw and model underfits.
But what could we do with these outliers overall? Like there always be a chance of a bad train set. Should we remove the outliers? Like what will happen it I remove these two bad rows from our disagregated results data and than aggregate again (so aggregate for 9998 observations instead of 10000)?
I removed two bad rows, aggregated again using sqrt(mean(value^2)) and CDML is back to normal!
Yes, it is still not the best model among others, it is actually still the worst. But it is not such a huge outlier anymore.
Overall, it seems like CDML sometimes produces very very bad results and when we aggregate those bad results, it looks like it is performing very bad in general.
Should we continue investigate by simulation run? Get rid of outliers?